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Abstract 

Multi-output Gaussian processes have received increasing attention during the last few years as a natural 
mechanism to extend the powerful flexibility of Gaussian processes to the setup of multiple output variables. 
The key point here is the ability to design kernel functions that allow exploiting the correlations between the 
outputs while fulfilling the positive definiteness requisite for the covariance function. Alternatives to construct 
these covariance functions are the linear model of coregionalization and process convolutions. Each of these 
methods demand the specification of the number of latent Gaussian process used to build the covariance 
function for the outputs. We propose in this paper, the use of an Indian Buffet process as a way to perform 
model selection over the number of latent Gaussian processes. This type of model is particularly important 
in the context of latent force models, where the latent forces are associated to physical quantities like protein 
profiles or latent forces in mechanical systems. We use variational inference to estimate posterior distributions 
over the variables involved, and show examples of the model performance over artificial data, a motion capture 
dataset, and a gene expression dataset. 


1 Introduction 

Kernel methods for vector-valued functions have proved to be an important tool for designing learning algorithms 
that perform multi-variate regression (Bonilla et ah, 2008), and multi-class classification (Skolidis and Sanguinetti, 
2011). A kernel function that encodes suitable correlations between output variables can be embedded in estab¬ 
lished machine learning algorithms like support vector machines or Gaussian process predictors, where the kernel 
function is interpreted as a covariance function. 

Different kernels for vector-valued functions proposed in recent years within the machine learning community, are 
particular cases of the so called linear model of coregionalization (LMC) (Journel and Huijbregts, 1978; Goovaerts, 
1997), heavily used for cokriging in geostatistics (Chiles and Delfiner, 1999; Cressie, 1993). Furthermore, the 
linear model of coregionalization turns out to be a special case of the so called process convolutions (PC) used in 
statistics for developing covariance functions (Ver Hoef and Barry, 1998; Higdon, 1998). For details, see Alvarez 
et al. (2012). 

Under the LMC or the PC frameworks, the way in which a kernel function for multiple variables is constructed, 
follows a similar pattern: a set of orthogonal Gaussian processes, each of them characterized by an specific 
covariance function, are linearly combined to represent each of the output variables. Typically, each of these 
Gaussian processes establishes the degree of smoothness that is to be explained in the outputs. In PC, the 
set of orthogonal Gaussian processes are initially smoothed through a convolution operation that involves the 
specification of the so called smoothing kernels. The smoothing kernel may be the impulse response of a dynamical 
system, or, in general, may correspond to the Green’s function associated to a differential equation. Gaussian 
processes that use a kernel constructed from a PC with a Green’s function as a smoothing kernel, have been 
coined by the authors of Alvarez et al. (2009) as latent force models. 

Despite its success for prediction, it is still unclear how to select the number of orthogonal Gaussian processes 
used for building the multi-output Gaussian process or a latent force model. Furthermore, in the context of 
latent force models where these orthogonal Gaussian processes may represent a physical quantity, like the action 
of a protein for transcription regulation of a gene or a latent force in a system involving masses and dampers, it 
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becomes relevant to undercover the interactions between the latent Gaussian processes and the output variables 
that are being modelled. 

In this paper, we use an Indian Buffet Process (IBP) (Griffiths and Ghahramani, 2005, 2011) for model selection 
in convolved multiple output Gaussian processes. The IBP is a non-parametric prior over binary matrices, that 
imposes an structure over the sparsity pattern of the binary matrix. It has previously been used for introducing 
sparsity in linear models (Knowles and Ghahramani, 2011). We formulate a variational inference procedure 
for inferring posterior distributions over the structure of the relationships between output functions and latent 
processes, by combining ideas from Alvarez et al. (2010) and Doshi-Velez et al. (2009). We show examples of the 
model using artificial data, motion capture data and a gene expression dataset. 


2 Convolved multiple output Gaussian processes 

We want to jointly model D output functions {fd{'^)}d=i^ 'where each output /d(x) can be written as 

Q 

fdi^) 

where Gd{x — x') are smoothing functions or smoothing kernels, {ug(x)}^^]^ are orthogonal processes, and the 
variables {Sd,q}^^ measure the influence of the latent function q over the output function d. We assume that 
each latent process Uq{:x.) is a Gaussian process with zero mean function and covariance function fcq(x,x'). 

The model above is known in the geostatistics literature as a process convolution. If Gd{-) is equal to the Dirac 
delta function, then the linear model of coregionalization is recovered (Alvarez et al., 2012). Also, in the context 
of linear dynamical systems, the function Gd{-) is related to the so called impulse response of the system. 


[ Gd(x - x')Mq(x') dx', 
Jx 


( 1 ) 


2.1 Covariance functions 


Due to the linearity in expression (1), the set of processes {fd{x)}f^i follow a joint Gaussian process with mean 
function equal to zero, and covariance function given by 

Q 

^/<i./,j/(x,x') = COv[/d(x)/dAx')] =J2^d,qSd',qkf^J^^^{x,x'), 

9=1 


where we have defined 


/c/9jj^(x,x') = y y Gd(x-z)Gd/(x-z')fc,(z, z')dzdz'. (2) 

Besides the covariance function defined above, we are interested in the covariance function between /d(x) and 
Mq(x), which follows 


= cov[/^i(x)ug(x')] = / Gd(x-z)A:g(z,x')dz. (3) 

Jx 

For some forms of the smoothing kernel Gd{-), and the covariance function kq{-), the covariance functions 
and (x,x') can be worked out analytically. We show some examples in section 6.1. 


2.2 Likelihood model for multi-output regressiou 

In a multi-variate regression setting the likelihood model for each output can be expressed as 

yd(x) = /d(x) -k Wdix.), 

where each /d(x) is given by (1), and {wid(x)}^^ are a set of processes that could represent a noise process 
for each output. Assuming that each rcd(x) is also a Gaussian process with zero mean and covariance function 
fcu;d,u)rf(x, x'), the covariance function between yd(x), and yd'i'x!) is given by 

kyd,y^,{^,^') = fc/d,/dAx,x') + fcu,^.»Ax,x')(5d.d', 
where Sd,d' is the Kronecker delta. 
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2.3 Inference and hyper-parameter learning 

Let V = {^d,yd}d=i bs 9' dataset for a multi-output regression problem. We use X to refer to the set 
and y to refer to the set We assume that we have N data observations for each output. The posterior 

distribution p(u|y), and the predictive distribution for f» at a new input point x* can both be computed using 
standard Gaussian processes formulae (Rasmussen and Williams, 2006). 

Different methods have been proposed for performing computationally efficient inference and hyperparameter 
learning in multi-output Gaussian processes (Alvarez and Lawrence, 2011), reducing the computational complexity 
from 0{N^D^) to 0{NDM'^), where M is a user-specified value. 

3 The Indian Buffet process 

An open question in models like the one described in Eq. (1) is how to choose the number of latent functions Q. 

In this report, we will use an Indian Buffet process as a prior to automatically choose Q. 

The IBP is a distribution over binary matrices with a finite number of rows and an unbounded number of columns 
(Griffiths and Ghahramani, 2005). This can define a non-parametric latent feature model in which rows are related 
to data points and columns are related to latent features. The relationship between latent features and data points 
can be encoded in a binary matrix Z € where Zd,q = 1 if feature q is used to explain data point d and 

Zd,q = 0 otherwise. Each element Z^^q of the matrix Z is sampled as follows 

Vj ~ Beta(Q;, I), 

9 

1=1 

Zd,q ~ Bernoulli(7r,), 

where a is a real positive value, and tt^ is the probability of observing a non-zero value in the column q of the 
matrix Z, this is, the value controls the sparsity for the latent feature q. As we will see, in our proposed model, 
the value of a is related to the average number of latent functions per output. 

Using an IBP as a prior for a linear Gaussian model, the authors in Doshi-Velez et al. (2009), derive two variational 
mean field approximations, referred to as a “finite variational approach”, and an “infinite variational approach”. 
We adopt the latter approach, this is, even though that the update equations will be based on the true IBP 
posterior over an infinite number of features, for a practical implementation, we use a level of truncation Q 
as the maximum number of latent functions. In this approach, as shown in previous equations, v = 
are independent samples from a Beta distribution, while tt = are dependent variables obtained by 

multiplying the sampled values for Vj as shown before. Thus, in the factorised variational distribution of our 
mean field approach we use v as hidden variables with the prior given before. We induce sparsity over the 
sensitivities by pre-multiplying Zd^q with Sd,q-, as explained in the next section. 

4 Variational formulation for model selection 

The model selection approach presented here is based on the variational formulation for convolved multiple output 
Gaussian processes proposed by Alvarez et al. (2010), and the variational formulation for the Indian Buffet Process 
proposed by Doshi-Velez et al. (2009). We start by defining the likelihood as 

D 

p(y|u,X,0,S,Z) =Ar(y|f,Sw) = 
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where u = S = [S'd.g] & , Z = [Zd^q] G {0, and each output vector is defined as 


/d(xi) 


1 ^d,qSd,q Gd,q{^l z)7/g(z)dz 

/d(x2) 

= 

1 ^d,qSd,q Gd,q{^2 z)7/g(z)dz 

_/d(xAr)_ 


1 ^d,qSd,q Gd,q{^N z)'Ug(z)dz 


For each latent function Uq{-), we define a set of auxiliary variables or inducing variables G R, obtained when 
evaluating the latent function Uq at a set of M inducing inputs {zm}m=i- We refer to the set of inducing variables 
using u = Following ideas used in several computationally efficient Gaussian process methods, we work 

with the conditional densities p(it|u), instead of the full Gaussian process p{u). The conditional density of the 
latent functions given the inducing variables can be written as 


p(m|u) 


Q 

9=1 


K- 


^q-i i^Ug,u, 


-k. 


K- 




with 


i^Ug ,-U, (Z, Zl ) , ^Ug 


(z, Z 2 ),..., kug^ug (z, zm)]- The prior over u has the following form 


Q 

p(u)= 

9=1 


For the elements of S we use an spike and slab prior as follows (Knowles and Ghahramani, 2011) 

PiSd,q\Zd,q,Jd,q) =Zd,qAfiSd,q\0,-fJg) + (1 - Zd,q)5{Sd,q), 

where Zd^q are the elements of the binary matrix Z that follows an Indian Buffet Process Prior. This is different 
from Titsias and Lazaro-Gredilla (2011) where all the variables Zd^q are drawn from the same Bernoulli distribution 
with parameter tt. From the previous section we know that the prior for Zd^q is given by 

p{Zd,q\TTq) = Bernoulli(Zd,g | TT,). 


To apply the variational method, we write the joint distribution for Zd^q and Sd,q as 

Pi^Sd^qi Zd^q\'T^qi 'Jd,q) —PiSd^q\Zd^q, 'yd,q)p{Zd^q\TTq) 

= [TTqAf{Sd,q\0,-fd^l)f'‘’'’[{i - T^q)S{Sd,q)]^~^'^’'^ , 

leading to 

D Q 

p{S, Z|t^, 7 ) = ji [TrqJ\f{Sd,q\0, ld,l)f''''‘ [(1 - 7l9)^(5'd.9)]^"^‘''‘*' 

d—1q—1 

We assume that the hyperparameters 'yd,q follow a Gamma prior, 

D Q 

Ph) = n n Gamma( 7 d,,|a;) ,j, 6 ] ,j). 

d—1 <7—1 


Although not written explicitly, the idea here is that Q can be as high as we want to. In fact, the IBP prior 
assumes that Q —>■ 00 , but in our variational inference implementations the value of Q is fixed and it represents 
the truncation level on the IBP (see Doshi-Velez et al. (2009)). According to our model, the complete likelihood 
follows as 


p(y,X,M, S,Z,’l>, 7 , 6 ») 


=p(y|X, M, S, Z, V, 7 , e)p{u\e)p{S, Z| 7 , v)p{v)p{‘j), 
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where 0 are the hyperparameters regarding the type of covariance function (see section 6 . 1 ). For the variational 
distribution, we use a mean field approximation, and assume that the terms in the posterior factorize as 

Q D Q D Q Q 

<i(u)=n9w- <i(s.z)=nn Q{Sd,q\Zd,q)Q{^d,q)^ — 

q—l d—lq—1 d—lq—1 q—l 

Following the same formulation used by Alvarez et al. ( 2009 ), the posterior takes the form 

q{u, u, S, Z, V, 7) = p{u\u)q{u)q{S, Z)q{-y)q{v). 


The lower bound that needs to be maximized, Fv(g(u), <7(8, Z), 9(7), <7(7^)), is given as (Bishop, 2006 ) 

' p(y|X, u, S, Z, u, 7, 0 )p(m|u)p(u)p(S, Z\v, 'r)p{v)p{'y) 


j g(M,U, S,Z,7J,7)log 


dw du dS dZ dt^ d7. 


g(u,u,S,Z,u,7) 

By using standard variational equations (Bishop, 2006 ), it can be shown that the lower bound Fy is given as 
Fv = \ log |A| + i log |Ku,u| - i log |S| - i tr(S-Vy^) 


D Q 


^ D Q ^ 

- 9 E E nZd,A] tr - - log 2^ ^ ^ F[Za,,] + E E Eflog 

d—1 q—l 


D Q 

EE 

d—1 q—l 


1 




'=1 q=l 

^ E E - log ^ E E '^F[Zd,qSlg, 

^ d=l q=l ^ d=l q=l ^d,q 

E E(1 “ E[Zd,g]) E[l0g(l - TT,)] - X! E l0gE(«X9) + E E “I9 log 

d—1q—l d—1q—l d—1q—l 

E E -1) - log -EE + (a -1) E [^{Tql) - +Tg2)] 

’=1 9=1 ^d,q q=l 

D Q 

+ EE[“^‘'’‘?l0g^'^.9 - (1 - %.9)l0g(l - Vd,q)] 
d—1 q—l 


d=l 9=1 
D Q 

+ EE'1^.9 

d=l 9=1 
Q 


ilogud,, + i(l + log( 27 r)) 


+ E log - ("^92 - 1)V’(t-92) + (Tql + Tq 2 - 2 )l/;(r,i + T^z) 

+ EE [logr(aX,) - (<5 - ^)'<P{.aXq) - log6X9 + «X9 


d=l 9=1 

where Kf,u = Ezs 0 Kf,u and A = A + Besides, Ezs is a block-wise matrix with blocks 

E[^d,95'd,5]lArxAr, A = Ku,u -h Ku,u, Ku.u = M 0 , with M being a block-diagonal ma¬ 

trix with blocks IjvxAfj and Kf,u = Vzs © Kf.u, with Vzs being a block-wise matrix with blocks given by 
(F[Zd^qS'^ Inxn- The operator 0 refers to an element-wise product, and it is also known as 

the Hadamard product. While, ' 0 (-) and F(-) are the digamma and gamma function, respectively. It is important 
to notice that the first six terms of the lower bound defined above have the same form as the lower bound found 
in Alvarez et al. ( 2010 ); Titsias ( 2009 ). 

To perform variational inference, we obtain the update equations for the posterior distributions q{u), <7(8, Z), 
(7(7), and q{v) from the lower bound given above. The update equations are included in appendix A, while the 
mean and the variance for the predictive distribution p(y»|y, 0 ) appear on appendix B. 
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5 Related work 


Convolved multiple output Gaussian Processes have been successfully applied to regression tasks, such as motion 
capture data, gene expression data, sensor network data, among others. In the context of latent force models, 
multiple output Gaussian processes can be used to probabilistically describe several interconnected dynamical 
systems, with the advantage that the differential equations that describe those systems, and the data observations, 
work together for accomplishing system identification (Alvarez et ah, 2013). Multiple output Gaussian processes 
are commonly trained assuming that each output is fully connected to all latent functions, this is, the value of 
each hidden variable Zd.q is equal to one for all d, and q. 

Several methods have been proposed in the literature for the problem of model selection in related areas of multiple 
output Gaussian processes. For example in multi-task learning, a Bayesian multi-task learning model capable 
to learn the sparsity pattern of the data features base on matrix-variate Gaussian scale mixtures is proposed in 
Guo et al. (2011). Later, a multi-task learning algorithm that allows sharing one or more latent basis for task 
belonging to different groups is presented in Kumar and III (2012). This algorithm is also capable of finding the 
number of latent basis, but it does not place a matrix variate prior ever the sensitivities. 

In a closely related work in multi-task Gaussian processes (Titsias and Lazaro-Gredilla, 2011), the problem 
of model selection was approached using the spike and slab distribution as prior over the weight matrix of a 
linear combination of Gaussian processes latent functions. The inference step is performed using the variational 
approach. 


6 Implementation 

In this section, we briefly describe the covariance functions used in the experiments. The hrst type of covariance 
function is based on a convolution of two exponential functions with squared argument. The second type of 
covariance functions is based on the solution of ordinary differentials equations (ODE), and each data point is 
linked to a time value. 


6.1 Covariance functions 

We use three different types of kernels derived from expressions kfi ti (x, x') in (2), and ktt ^ (x, x') in (3). In 
turn, these expressions depend of the particular forms for Gdi'), and kg(-,-). 


6.1.1 General purpose covariance function 

Here we present a general purpose covariance function for multi-output GPs for which x. G X G If we assume 
that both the smoothing kernel Gd{-) and kq{-,-) have the following form 


fc(x, x') 


|p|l/2 



x')^P(x-x') , 


where P is a precision matrix, then it can be shown that the covariance function kti ti (x, x') follows as 

J d d' 






exp 


--(x-x')^(Pl,,)-'(x-x') 


where Fdd' = ^ Matrices P^ and A, correspond to the precision matrices associated to Gd{-), 

and kq{-,-), respectively. For the experiments, we use diagonal forms for both matrices, where {Pd,i}i=i the 
elements for the diagonal matrix P,;, and {iq,i}i=i ^re the diagonal elements of the matrix Aq. In the following 
sections, we refer to this covariance function as the Gaussian Smoothing (GS) kernel. 


6.1.2 Latent force models 

Latent force models (LFM) can be seen as a hybrid approach that combines differential equations and Gaussian 
processes (Alvarez et ah, 2009; Alvarez et ah, 2013). They are built from convolution processes by means of a 
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deterministic function (which relates the data to a physical model), and Gaussian process priors for the latent 
functions. In the next two sections we show examples with a first order ordinary differential equation (ODEl), 
and a second order ordinary differential equation (ODE2). In both cases, we assume the latent functions to be 
Gaussian processes with zero mean and covariance functions given by 


kq{t, t') = exp 


[t-tT 

p 

G J 


(4) 


We can derive the covariance function for the outputs following (2). In this context, the smooothig kernel Gd{-) 
is known as the Green’s function, and its form will depend on the order of the differential equation. 


First order differential equation (ODEl) We assume that the data can be modelled by the first order 
differential equation given by 


d/d(t) 

dt 


Q 


+ Bdfd{t) = ^ Sd,qUq{t), 
9=1 


(5) 


where Bd is the decay rate for output d. Solving for fd(t) in Equation (5), we get a similar expression to the one 
obtained in equation (I), where the smoothing kernel Gd{-) (or the Green’s function in this context) is given by 

Gdit - t') = exp [-Bd{t - t')] . 

Using the above form for the smoothing kernel Gd{-), and the covariance function kq(-^ ■) given in (4), we derive 
the expression for kf9 ji^{t,t') using (2), which is (Lawrence et ah, 2006) 


fQ 

Jd'Jd' 

where hq{Bd', Bd,t',t) is defined as 

exp(^L') 


V^lq 


[hq{Bd<,Bd,t' ,t) + hq{Bd,Bd',t,t')], 


hq{Bd',Bd,t',t) =— —exp(-Sd/t')i exp{Bd't) 


Bd + Bc^ 

- exp{-Bdt) 


erf 


t' - t 


Vq,d‘ 


erf I — + Vq^d‘ 


erf ( ^- Vq^d' ) + eT:i{vq^d') 


( 6 ) 


with Vq^d = IqBd/^, and erf(-) is the error function defined as 

I 

erf(z) = — exp (—dt. 

271 Jo 

In order to infer the latent functions Uq(t) related in (5), we calculate the cross-covariance function between fd{t) 
and Uq(t) using (3), as follows 


(t, t') =Sd,q^^^ exp{vla) exp [-Bd{t - t')] 


erf 


t-t' \ Jt' 

- Vq,d M- erf — -I- Vq^d 


Second order differential equation (ODE2) In this scenario, we assume that the data can be explained 
using a second order differential equation related to a mechanical system 

+ Gd^^^^ + Bdfdit) =^Sd,qUq{t), (7) 

9=1 

where {md}d=i are mass constants, {Gd}d=i are damper constants, and are spring constants. Without 

loss of generality, the value of the mass is set to one. Now, assuming initial conditions equal to zero, the 
solution for the Green’s function associated to (7) is given by 

Gd{t -t') = — exp \-ad{t - t')] sin [uJd{t - t ')], 


7 



















where ad is the decay rate and ujd is the natural frequency. Both variables are defined as 


ad = 


Cd 


UId = 


V^Bd-Ci 


2 ’ 2 
It can be shown that kti ci reduces to (Alvarez et ah, 2009) 

=Ko[hq{^d','ld,t,t') + hq{-fd,Jd',t',t) + hq(nd',ld,t,t') + hgi^Jd, 'Id' , t', t) 

- hq{fyd',ld,t,t') - hq{jd,^d',t',t) - hq{'yd',Jd,t,t') - hq{'yd,'yd',t',t)], 

where Kq = 'Jd = ad + juJd, Jd = ad — joJd and hq[-) is the function defined in (6). Additionally, if ujd and 

uJd' take real values, the expression above simplifies as 

=‘^KoRe[hq{-id',ld,t,t') + hq{id,ld',t',t) - hqi'yd',-fd,t,t') - hq{'yd,ld',t',t )], 

where Re(-) refers to the real part of the argument. For the cross-covariance it can be shown that 

the solution for (3) is 


^fd,uq ) — 


lqSd,q'/^ 

j^^d 


where 


'^q(.ld,t,t') = e 4 


''1 _ p-id(t-t') 


erf 


[Tq{jd,t,t') - Tq{-id,t,t')], 




6.2 Variational inference procednre 

The variational inference procedure can be summarized as follows. We give initial values to the parameters 
of each variational distribution, and initial values to the parameters of the covariance functions. We also set 
the values of a, and Q. An iterative process is then performed until a criterion of convergence is fulfilled. At 
each iteration, we update the moments for each variational distribution as shown in appendix A. Alongside, 
every ten iterations in the variational inference method, we estimate the parameters 0 of the kernel functions, 
by maximizing the lower bound Fy using the scaled conjugate gradient method. The derivatives ^ 

and are calculated using expressions similar to the ones obtained in Alvarez et al. (2009). We combine 

those derivatives with the derivatives of Ku,u, Ku.f, and Kf^f wrt 9. We use the software GPmat (https; 
//github. com/Shef f ieldML/GPmat) to train and test models based on latent force models. 

7 Results 

In this section, we show results from different datasets, including: artificial data, motion capture data, and gene 
expression data. For the artificial datasets, we are interested in recovering the known interconnection matrix (Z) 
between the latent functions and outputs. For the real datasets, we analyse the regression performance of the 
proposed method under different configurations. 


7.1 Synthetic Data 

To show the ability of the proposed model to recover the underlying structure between the output data and the 
latent functions, we apply the method to two different toy multi-output datasets. Each toy dataset is built by 
sampling from the model explained in section 4. 

Example 1: The first experiment is conducted using a GS covariance function (see section 6.1.1) and sample 
from the model with D = 3, Q = 2 and a = 1. For the smoothing kernels Gd{x,x'), we set the length-scales to 
Pi,i = 0.01, p 2 ,i = 1/120, and pap = 1/140. We use the following values for matrices Z, and S, 


Z = 


0 

1 


0 

1.48 

1 

0 

, S = 

-3.19 

0 

1 

0 


6.87 

0 
















For the covariance functions kq{x,x') of the latent functions, we choose the length-scales as Z14 = 0.1 and 
^2,1 = 0.2. Next, we sample the model and generate 30 data points per output, evenly spaced in the interval [0,1]. 
We assume that each process Wd(x) is a white Gaussian noise process with zero mean, and standard deviation 
equal to 0.1. 

The model is then trained using the proposed variational method with a maximum number of latent functions set 
to four. Additionally, for the variational distribution of latent functions, we set 15 inducing points evenly space 
along the output interval. 



(a) Hinton diagram toy 1 



(b) Ouput 1 toy 1 



(c) Ouput 2 toy 1 



(d) Ouput 3 toy 1 


Figure 1: Results for model selection for example 1. Hinton diagram for and, mean and two standard 

deviations for the predictions over the three outputs. 


Figure 1 shows the results of model selection for this experiment. We use a Hinton diagram to display the 
estimated value for E[Z], in Figure la. We notice from the Hinton diagram that there are two main latent 
functions which are used by the model to explain the data. The first column of the Hinton diagram corresponds 
to the second column of matrix Z, while the second column of Hinton diagram corresponds to the first column 
of matrix Z. The posterior mean functions for each output closely approximate the data, as shown in Figures lb 
to Id. 


Example 2: The second experiment is conducted using an ODE2 covariance function (Alvarez et ah, 2009). 
We generate data using D = 3, Q = 2 and a = 1. For each differential equation, we have the following values 
for the springs: Bi = 4, H2 = 1, and B 3 = 1. The values for the dampers are Ci = 0.5, C 2 = 4, and C 3 = 1. 
Matrices Z, and S are set to the following values 


1 

0 


- 2.61 

0 

0 

1 

s = 

0 

2.66 

1 

0 


1.10 

0 


The length-scales for the covariance functions of the latent Gaussian processes were set to h^i = 0.2, and h,! = 0.4. 
We sample from the model, and generate 50 data points per output evenly spaced across the interval [0,5]. We 
truncate the number of latent functions to Q = 4. 



(a) Hinton diagram toy 2 



(b) Output 1 toy 2 




(c) Output 2 toy 2 (d) Output 3 toy 2 


Figure 2: Results for model selection in toy example 2. In Figure 2a, Hinton diagram for E[Zci,g]. In Figures 2b 
to 2d, mean and two standard deviations for the predictive distribution over the three outputs. 


We perform the same evaluation as the one performed in example 1. The Hinton diagram in Figure 2a, shows 
the values for E[Zd_g]. We recover the structure imposed over the original matrix Z: columns first and third in 
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the Hinton diagram recover the ones and zeros in Z, whereas columns second and fourth have entries with very 
small values. 

For this experiment, we used ydit) = fd{t) (we did not use an independent process Wd{t)). The mean predictive 
function together with the actual data is shown in Figures 2b to 2d. 

In the following sections, we evaluate the performance of the proposed model selection method in human motion 
capture data and gene expression data. 

7.2 Human motion capture data 

In this section, we evaluate the performance of the proposed method compared to the Deterministic Training 
Conditional Variational (DTCVAR) inference procedure proposed in Alvarez et al. (2009). DTCVAR also uses 
inducing variables for reducing computational complexity within a variational framework, but assumes full connec¬ 
tivity between the latent functions and the output functions (meaning that Zd^q = 1, for all q, and d). Parameters 
for all the kernel functions employed are learned using scaled conjugate gradient optimization. 

7.2.1 Performance of the model in terms of changing number of outputs and the truncation level 

We use the Carnegie Mellon University’s Graphics Lab motion-capture motion capture database available at 
http://mocap.cs.cmu.edu. Specifically, we consider the movement walking from subject 35 (motion 01). From 
this movement, we select 20 channels from the 62 available (we avoid channels where the signals were just noise 
or a straight line). Then, we take 45 frames for training and the rest 313 frames are left out for testing. Table 1 
shows a performance comparison between our proposed model and a model trained using DTCVAR, taking into 
account different types of covariance functions. Performance is measured using standardized mean square error 
(SMSE) and mean standardized log loss (MSLL) over the test set for different combinations of number of outputs 
(D) and number of latent functions (Q). Similar results are obtained by both models using the GS covariance 
function. Even-though, the model based on DTCVAR outperforms the proposed one in the cases D = 10,Q = 7, 
and D = 15,Q = 9, the DTCVAR approach uses all latent functions, while the IBP approach uses only two latent 
functions, as shown in Figure 3. 


Inference setup 

Measure 

D=5, Q=4 

D=10, Q=7 

11=15, Q=9 

D=20, Q=14 

GS 

SMSE 

0.241 

0.1186 

0.3072 

0.3796 


MSLL 

-1.1334 

-1.6924 

-1.0322 

-0.9196 

IBP -b GS 

SMSE 

0.1538 

0.3267 

0.3494 

0.3605 


MSLL 

-1.6083 

-1.0140 

-0.8819 

-0.8118 


Table 1: Standardized mean square error (SMSE) and mean standardized log loss (MSLL) for different number 
of outputs and latent functions. 

For the other two cases (D = 5, Q = 4, and D = 20, Q = 14), the proposed method presents a similar performance 
compared to the model estimated by DTCVAR using all latent functions, showing that to obtain an adequate 
approximation of the output data we do not require to use the maximum number of the latent functions. 

7.2.2 Performance for different kernel functions 

In this section, we compare the performance of the proposed model and the one trained using DTCVAR with 
different covariance functions over the same dataset. In this case, we consider the walking movement from subject 
02 motion 01. From the 62 channels, we select 15 for this experiment. We assume a maximum of nine latent 
functions, and make a comparison between the GS and the ODE2 covariance functions. The latter is used because 
human motion data consists of recordings of an skeleton’s joint angles across time, which summarize the motion. 
We can use a set of second order differential equations to describe such motion. Table 2 reports the SMSE and 
MSLL measures for each type of training method and covariance function. Our proposed method presents better 
results, with the ODE2 kernel being the kernel that best explains the data. 
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(b) D=10, Q=7 


(c) D=15, Q=9 


(d) D=20, Q=14 


(a) D=5, Q=4 

Figure 3: Hinton diagrams of E[Z] for each pair of outputs and latent functions tested in table 1. 



ODE2 

IBP -b ODE2 

GS 

IBP -b GS 

SMSE 

0.5463 

0.2087 

0.5418 

0.1790 

SMLL 

-0.6547 

-1.2725 

-0.7863 

-1.1993 


Table 2: Standardized mean square error (SMSE) and mean standardized log loss (MSLL) for different models 
and different kernel functions. 


Comparing the Hinton diagrams from both covariance functions (see Figure 4), we find similar results. For 
example, there is a similar composition of elements between columns 3 and 8 from the Hinton diagram of the 
ODE2 kernel, with columns 6 and 2 from the Hinton diagram of the GS kernel. Both covariance functions try to 
unveil a similar interconnection between outputs and latent functions. 



(a) IBP + ODE2 (b) IBP + GS 


Eigure 4: Hinton diagrams for the IBP variational approximation using (a) ODE2 and (b) GS covariance functions. 

Eigure 5 shows the Gaussian process mean and variance for the predictive distribution of six outputs from the 
model inferred from IBP + ODE2. In most of the predictions, the model explains the testing data points with 
adequate accuracy. 
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(e) Channel 3 Thorax (f) Channel 2 Head 


Figure 5: Mean (solid line) and two standard deviations (gray shade) for predictions over six selected outputs 
from IBP + ODE2 trained model. 



(d) CIpl/p21 (e) SESNl/hPA26 

Figure 6: Mean (solid line) and two standard deviations (grey shade) for the predictions over five gene expression 
levels from the IBP + ODEI model. 


7.3 Gene expression data: Tumour Suppressor Protein p53 

Gene expression data consist of measurements of the mRNA concentration of a set of genes. mRNA concentration 
for each gene is regulated by the so called transcription factor (TFs) proteins. In transcriptional regulatory 
networks, a TF or a set of TFs may act in a individually or collaborative manner, leading to complex regulatory 
interactions. 

Gene expression data can be related to a first order differential equation (Barenco et ah, 2006), with the same 
form given in equation 5. From this equation, and in the context of gene expression, the output fd{t) is the 
mRNA concentration of gene d, Bd is the linear degradation rate of fd{t), and u{t) is the concentration of the 
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TF. Two major problems arise from the analysis of gene expression, first to determine the interaction network 
and second to infer the activated transcription factor (Gao et ah, 2008). 

In this experiment, we use tumour suppressor protein p53 dataset from Barenco et al. (2006). This dataset is 
restricted to five known target genes: DDB2, BIK, TNFRSFlOb, CIpl/p21 and SESNl/hPA26. In Lawrence et al. 
(2006), a transcription factor is inferred from the expression levels of these five target genes using a covariance 
function build from a first order differential equation. Our aim is to determine the number of transcription factors 
(latent functions) and how they explain the activities of the target genes using the covariance function defined in 
6 . 1 . 2 . 



Figure 7: Hinton diagrams for the IBP variational approximation using ODEl covariance function. 

Results obtained from this dataset regarding the number of latent forces, concurred with the description given in 
Barenco et al. (2006), where there is one protein influencing the expression level of the genes analysed (see Figure 

7). 

8 Conclusions 

We have introduced a new variational method to perform model selection in convolved multiple output Gaussian 
Processes. Our main aim was to identify the relationship between the latent functions and the outputs in multiple 
output Gaussian processes. The proposed method achieved comparable results to the model that assumes full 
connectivity between latent functions and output functions. This makes our method suitable to applications 
where the complexity of the model should be reduced. The proposed model selection method can be applied in 
other applications that involve the use of a covariance function based on differential equations, such as inferring 
the biological network in gene expression microarray data. 

For the artificial dataset examples we found that the model selection method converges to a similar matrix for 
the interconnections between latent functions and outputs. We have illustrated the performance of the proposed 
methodology for regression of human motion capture data, and a small gene expression dataset. 
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A Computing the optimal posterior distributions 

In this appendix, we present the updates of variational distributions q{S, Z), q{u), ( 7 ( 7 ) and q{v). To do so, first, 
we rewrite the lower bound defined in section 4, as 
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Q 
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Additionally, S-w is the covariance matrix for the observation noise. 

A.l Updates for distribution g(u) 

Taking into account that (/(u) = ft ^(uq) and g(uq) = ^(uqlug, Ku .u )■ Then, it can be shown that the moment 

9=1 

updates are 
with 
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u, = (y, - y), 
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and 


Q 

y ~ E[Ug/]. 

9'=i.9Vj 

A.2 Updates for distribution ^(S, Z) 

We assume that the distribution 9 ( 8 , Z) is defined as 


D Q 

g(S,Z) = g(S|Z)g(Z) = nn q{Sd,q\Zd,q)q{Zd^q). 

d—1 q—1 


First, we calculate the update parameters for the distribution q{Zd^q), which is defined as 


Also, q{Sd,q\Zd^q) is defined as 

Q{Sd,q\Zd,q = 1 ) = ■^{Sd,q\^^Sd,q^^d,q)■ 

Thell it can be shown that the update for Vd^ is given by 
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While, the update for parameter ^Sd i i® calculated as 
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Finally, the update for rjd^i is given by 
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- log^Xi - 2 + t4d.i) ~ E[log(l - 7rd] + - ln(27reUd,d, 

I 1 


For computing Eq(„)[log(l — rii=i''^*)]; would need to resort to a local variational approximation (Bishop, 
2006) in a similar way to Doshi-Velez et al. (2009). 
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A.3 Updates for distribution 5 ( 7 ) 

This distribution is defined as 

D Q 

9 ( 7 ) = n n Gannna(7d.,|a]*^,6]_*^). 

d—1q—1 

It can be shown that the updates for the parameters 6^ q and ^ are given by 

bXq = lnZd,qSjj+bl^, 

and 

“Xg = 2 + “Xg- 


A.4 Updates for distribution q{v) 

We assumed that the optimal distribution for each q{vi) = Beta(ui|Tii, Ti 2 )- Given a fixed value for i, the updates 
for parameters th and Ti 2 are given by 
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B Predictive distribution 


For the predictive distribution, we need to compute 


P{y*\^)= f [ p(y*|6,n, S,Z)g(M, u)q(S,Z)dududSdZ. 

JS,Z Ju,u 


It can be demonstrated that the above integral is intractable. Since we are only interested in the mean and 
covariance for y*, we can still compute them using 
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we get 
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where ad,k = (^f^i Ku^.f^ Swjyj) ^ and ad = [ 0 ^, 1 ,..., OLd^q]. Notice that the in the expres¬ 

sion for ad,k must be computed at the test point x*. 


We need to compute now the second moment of y* under p(y»|-). Again, we can write 
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Finally, the covariance cov[ydyJ] would be given as 
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cov[ydyj ] = ^ ^ + S 
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i=l 


i=l 


Bear in mind, we have omitted * in the vectors above to keep the notation uncluttered. 
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